Immunotherapy datasets analyses

This part will clearly describe how to analyze APS, TMB and TIGS in immunotherapy datasets, including cancer type level and individual level. Raw data used for these analyses have been preprocessed, detail please read preprocessing part or TCGA pan-cancer analyses of this analysis report.

TIGS and pan-cancer objective response rate to PD-1 inhibition

Previous study has shown that TMB could predict pan-cancer ICI objective response rates (ORR). Here we will evaluate and compare the predictive power of APS, TIGS with TMB in pan-cancer ICI objective response rates prediction. The ORR for anti–PD-1 or anti–PD-L1 therapy will be plotted against the corresponding median APS, TIGS, TMB across multiple cancer types.

Through an extensive literature search, we identified 26 tumor types or subtypes for which data regarding the ORR are available. For each tumor type, we pooled the response data from the largest published studies that evaluated the ORR. We included only studies of anti–PD-1 or anti–PD-L1 monotherapy that enrolled at least 10 patients who were not selected for PD-L1 tumor expression (Identified individual studies and references are available in the manuscript Supplementary Table 3). The median tumor mutational burden for each tumor type was obtained from a validated comprehensive genomic profiling assay performed and provided by Foundation Medicine. The APS information for 23 tumor types were calculated based on TCGA datasets, and the APS for merkel cell carcinoma, cutaneous squamous cell carcinoma and small-cell lung cancer were calculated based on GEO microarray datasets.

Combination of datasets

In code implementation, we firstly combine APM score, TIGS score from TCGA studies and GEO datasets.

Using the combined dataset df_all, we can normalize APM score to [0, 1] region, then we calculate median APM score for each tumor type.

Lastly, we merge this data to 26 tumor types or subtypes for which data regarding the ORR, TMB are available by hand using Excel. This process has been double checked. With the merged data, we can explore association of ORR and APM score, ORR and TMB etc. in immunotherapy datasets.

Significant correlation between APS, TMB and the ORR

We plot ORR against APS, TMB respectively, and fit their relationship with linear model. We observe that significant correlation between APS, TMB and the ORR.

Load pan-cancer ORR data and show it as a table.

Focus on cancer type with APS, fit data with linear model.

Of note, here we use log(TMB) will not change the relationshipe between ORR and TMB.

We observe that 28% variance of ORR can be explained by APS, and 44% variance of ORR can be explained by TMB. The result of TMB is similar to NEJM paper Tumor Mutational Burden and Response Rate to PD-1 inhibition. This result show that TMB and APS are biomarker that predicting ORR of immunotherapy, TMB is better than APS.

Next we plot the linear model.

TIGS definition and performance

From the aspect of biological process, TMB release is independent from processing and presentation of tumor antigens. Now that we have found TMB and APM are biomarkers which can predict object reponse rate of cancer type in immunotherapy, these two factors are independent, and are both required for tumor immunogenicity determination, why not integrate them as a new biomarker which should be better than TMB?

Therefore, theoretically, tumor immunogenicity can be represented as

\[ [“Tumor \quad antigenicity”] \times [“Antigen \quad processing \quad and \quad presenting \quad status”]. \]

We use log(TMB) here because log(TMB) has been proved to have a linear relationship with ORR. So the formula can be written as

\[ TIGS = APS_{normalized} \times log(TMB) \]

However, some tumors have TMB level below 1 mutation/ Mb, to avoid minus number in quantifying “tumor antigenicity”, we add number 1 to all normalized TMB. So in this case, the TIGS formula is

\[ TIGS = APS_{normalized} \times log(TMB + 1) \]

When TMB < 1 mutation/Mb, the log(TMB) would less than 0, if APS is very big, we would get a big minus number in this case, which against common sense.

Now we construct a linear model by integrate TMB and APM.

We observe that 66% variance of ORR can be explained by TIGS, this result is indeed bettern than TMB.

More exploration of TIGS formula

Would other combination of TMB and APS be better?

As mentioned above, we constructed the formula of TIGS according to our understanding of “tumor antigenicity”. Is there a better combination of these two factors?

There are three basic models we take into consideration.

  • Model 1: \(ORR = d \times (a \times APS + b\times log(TMB)) + c\), i.e. \(TIGS = a \times APS + b \times log(TMB)\)
  • Model 2: \(ORR = a \times (APS \times log(TMB)) + b\), i.e. \(TIGS = APS \times log(TMB)\) (This is the model we used above)
  • Model 3: \(ORR = e \times (a \times APS + b\times log(TMB) + c\times APS \times log(TMB)) + d\), i.e. \(TIGS = a \times APS + b\times log(TMB) + c\times APS \times log(TMB)\)

a, b, c, d and e here are coefficients. In R, we don’t need to write them in models, it will take care of them.

Now, let’s observe the results.

# Model 1
lm(Pool_ORR ~ log(Pool_TMB)+Pool_APM, data = sm_data)  %>% summary() 
#> 
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB) + Pool_APM, data = sm_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -12.628  -7.138  -1.461   3.109  29.314 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    -20.490      7.375  -2.778 0.010686 *  
#> log(Pool_TMB)   12.172      2.761   4.409 0.000203 ***
#> Pool_APM        39.417     12.643   3.118 0.004840 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 10.16 on 23 degrees of freedom
#> Multiple R-squared:  0.6101, Adjusted R-squared:  0.5762 
#> F-statistic: 17.99 on 2 and 23 DF,  p-value: 1.979e-05
# Model 2
lm(Pool_ORR ~ log(Pool_TMB):Pool_APM, data = sm_data)  %>% summary() 
#> 
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB):Pool_APM, data = sm_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -12.382  -6.262  -2.004   2.911  25.404 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)              -1.857      3.496  -0.531      0.6    
#> log(Pool_TMB):Pool_APM   25.092      3.707   6.769 5.31e-07 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9.34 on 24 degrees of freedom
#> Multiple R-squared:  0.6563, Adjusted R-squared:  0.642 
#> F-statistic: 45.83 on 1 and 24 DF,  p-value: 5.306e-07
# Model 3
lm(Pool_ORR ~ log(Pool_TMB)*Pool_APM, data = sm_data)  %>% summary() 
#> 
#> Call:
#> lm(formula = Pool_ORR ~ log(Pool_TMB) * Pool_APM, data = sm_data)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14.0233  -6.8783  -0.4445   3.4839  25.8232 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)               12.24      16.85   0.726   0.4753  
#> log(Pool_TMB)            -12.93      12.08  -1.071   0.2959  
#> Pool_APM                 -23.41      31.80  -0.736   0.4693  
#> log(Pool_TMB):Pool_APM    46.72      21.96   2.127   0.0449 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 9.463 on 22 degrees of freedom
#> Multiple R-squared:  0.6766, Adjusted R-squared:  0.6325 
#> F-statistic: 15.34 on 3 and 22 DF,  p-value: 1.314e-05

The model 1 is the worst one. Model 3 is a little better than model 2, however model 2 is much simpler (model 2 is included in model 3!). Following Occam’s razor, we choose model 2 when compare it with model 3. We also choose model 2 when compare it with model 1 because it is better and consist with the fact APM and TMB are relatively independent but not absolutely independent. To summary, model 2 is the best option.

Any promotion of formula \(TIGS = APS \times log(TMB + 1)\)?

We still have a question about TIGS formula. The TMB in formula is calculated as

\[ TMB = ln(\frac{whole \quad exome \quad mutation \quad number}{38} + 1 ) \]

We choose nonsynonymous mutations because they have functional effects. Only functional effects on protein will generate neo-antigen (peptide).

We use 38 here because 38 MB is the estimated length of exome, so we can normalize TMB to per MB. However, in math model, there may be a better value “a” than 38. We don’t know yet, so we need to explore this “a” value.

We simulate “a” value from 0.1 to 10000 with step size 0.1 in the following formula, the resulted TIGS is then used to fit a linear model for pan-cancer ICI ORR prediction. We calculate differnt R Square value under each “a” and see how it changes.

\[ TIGS = APS_{normalized} \times ln(\frac{whole \quad exome \quad mutation \quad number}{a} + 1) \]

Implementation:

Plot the simulation result.

We can find that R squares of linear model have a maximum value when “a” is around 40. Therefore, “a=38” is optimized to rescale the raw whole exome mutation counts and balance the contribution for APM and TMB in TIGS formula.

Performance comparison on immunotherapy clinical response prediction and survival benefit

Description of datasets please see preprocessing part or “Method Section” of manuscript.

TMB is a well-known biomarker and has been applied. Considering our TIGS is based on TMB, thus the most important point we wanna do in this section is comparing the prediction/clinical perfermance between TMB and TIGS.

However, some other biomarkers can predict immunotherapy response have been studied by other groups. Therefore, this section we will include following biomarkers in comparsion.

  • APS - antigen processing and presenting efficiency score
  • TMB - tumor mutation burden
  • TIGS - tumor immunogenicity score
  • IIS - immune infiltration score
  • IFNG - interferon gamma response biomarkers of 6 genes including IFNG, STAT1, IDO1, CXCL10, CXCL9, and HLA-DRA
  • CD8 - gene expression level of CD8A + CD8B
  • PDL1 - an immunohistochemistry (IHC) biomarker approved by FDA. This study we use PD-L1 gene expression as the IHC surrogate
  • TIDE - TIDE signature which integrates the expression signatures of T cell dysfunction and T cell exclusion to model tumor immune evasion.

The predicted values of gene expression biomarkers (i.e. IFNG, CD8, PDL1) are the average values among all members defined by the original publications.

Preprocessing of immunotherapy datasets

The first step is to preprocess 3 immunotherapy datasets with gene expression, TMB and clinical outcome available. APS from Pancan analysis is used to normalize APS here.

Hugo et al 2016 dataset

## Data source: Genomic and Transcriptomic Features of Response to Anti-PD-1 Therapy in Metastatic Melanoma. Cell 2016 Mar 24;165(1):35-44. PMID: 26997480
suppressPackageStartupMessages(library(GEOquery))
library(readxl)
geo_dir = "../data/GEOdata"

gse = "GSE78220"
GSE_78220 = getGEO(gse, GSEMatrix = TRUE, AnnotGPL = TRUE, destdir = geo_dir)
GSE_78220 = GSE_78220$GSE78220_series_matrix.txt.gz

#### Download gene expression using following command
# download.file(url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE78220&format=file&file=GSE78220%5FPatientFPKM%2Exlsx", destfile = "../data/GSE78220_FPKM.xlsx")

GSE78220_FPKM = readxl::read_xlsx("../data/GSE78220_FPKM.xlsx")

GSE78220_FPKM = GSE78220_FPKM %>% 
    as.data.frame() %>% 
    mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>% 
    arrange(Gene, desc(mean_expr)) %>%
    distinct(Gene, .keep_all = TRUE) %>% 
    dplyr::select(-mean_expr) %>% 
    as.tibble()

# pData(GSE_78220) %>% View()

# normalize gene expression as log(x + 1)
GSE78220_Norm = GSE78220_FPKM
GSE78220_Norm[,-1] = log2(GSE78220_Norm[,-1] + 1)

# apply gsva
applyGSVA(merged_geneList, group_col = "Cell_type", 
          gene_col = "Symbol", ExprMatList = list(GSE78220_Norm), method = "gsva") -> gsva.hugo2016

gsva.hugo2016 = calc_TisIIs(gsva.hugo2016[[1]])

# normalize APS
APS_hugo = gsva.hugo2016$APM
APS_hugo = (APS_hugo - min_APS) / (max_APS - min_APS) 

names(APS_hugo) = colnames(GSE78220_Norm)[-1]
APS_hugo
names(APS_hugo) = sub(pattern = "(Pt.*)\\.baseline", "\\1", names(APS_hugo))


# keep biggest value
APS_hugo = APS_hugo[-21]
names(APS_hugo)[c(14, 20)] = c("Pt16", "Pt27")
APS_hugo

# keep corresponding expression data for downstream analysis
rna_hugo = GSE78220_Norm[, -22]
colnames(rna_hugo)[-1] = names(APS_hugo)
save(rna_hugo, file = "results/Hugo2016_RNA.RData")


GSE78220.APS = tibble(PatientID = names(APS_hugo), APS=APS_hugo, IIS = gsva.hugo2016$IIS[-21])

# read clincal and tmb data
GSE_78220.TMB = read_csv("../data/Cell2016.csv", col_types = cols())
GSE_78220.TMB$nTMB = GSE_78220.TMB$TotalNonSyn / 38

info_hugo = full_join(GSE_78220.TMB, GSE78220.APS, by = "PatientID")
info_hugo = info_hugo[,c(1:5, 14:16, 13)]

# calculate TIGS score
info_hugo$TIGS = log(info_hugo$nTMB) * info_hugo$APS

save(info_hugo, file = "results/Hugo2016_Info.RData")

Van Allen et al 2015 dataset

#----------------------------------------------------------
# Data source :
# Science 2015: Genomic correlates of response to CTLA4 blockade in metastatic melanoma
#--------------------------
library(readxl)
suppressPackageStartupMessages(library(tidyverse))
science2015_TMB = read_xlsx("../data/Science2015_TMB_List_AllPatients.xlsx")
science2015_cli1 = read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 1)
science2015_cli2 = read_xlsx("../data/Science2015_Clinical.xlsx", sheet = 2)
science2015_rna = read_tsv("../data/MEL-IPI-Share.rpkm.gct", col_types = cols())

#---- Preprocess
vc.nonSilent = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
                 "Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
                 "In_Frame_Ins", "Missense_Mutation")

science2015_TMB %>% group_by(patient) %>% 
    summarise(TotalMutation=n(), 
              TotalNonSyn=sum(Variant_Classification %in% vc.nonSilent), 
             nTMB = TotalNonSyn / 38) -> science2015_TMB_summary

#---- remove duplicate symbols of rna data
science2015_rna %>% mutate(symbol = sub("(.*)\\..*$", "\\1", Description)) %>% 
    dplyr::select(-Name, -Description) %>% 
    dplyr::select(symbol, everything()) %>% 
    as.data.frame() %>% 
    mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>% 
    arrange(symbol, desc(mean_expr)) %>%
    distinct(symbol, .keep_all = TRUE) %>% 
    dplyr::select(-mean_expr) %>% 
    as.tibble() -> science2015_ge

length(intersect(science2015_ge$symbol, GSE78220_Norm$Gene))
# the gene number of science paper is bigger than hugo 2016
# maybe miRNA and some other RNA included
# keep them almost equal\
science2015_ge = science2015_ge %>% 
    filter(symbol %in% GSE78220_Norm$Gene)

science2015_ge[, -1] = log2(science2015_ge[, -1] + 1)

applyGSVA(merged_geneList, group_col = "Cell_type", 
          gene_col = "Symbol", 
          ExprMatList = list(science2015_ge), method = "gsva") -> gsva.VanAllen2015

gsva.VanAllen2015 = calc_TisIIs(gsva.VanAllen2015[[1]])

APS_VanAllen = gsva.VanAllen2015$APM
names(APS_VanAllen) = colnames(science2015_ge)[-1]
APS_VanAllen = (APS_VanAllen - min_APS) / (max_APS - min_APS)

APS_VanAllen
names(APS_VanAllen) = sub("^MEL.IPI_(.*)\\.Tumor.*$", "\\1", names(APS_VanAllen))
APS_VanAllen

rna_VanAllen = science2015_ge
colnames(rna_VanAllen)[-1] = names(APS_VanAllen) 
## save gene experssion
save(rna_VanAllen, file = "results/VanAllen2015_RNA.RData")

APS_VanAllen_df = tibble(patient = names(APS_VanAllen), APS = APS_VanAllen, 
                         IIS = gsva.VanAllen2015$IIS)

science2015_cli = bind_rows(
    dplyr::select(science2015_cli1, patient, age_start, RECIST, overall_survival, progression_free, primary, group, 
           histology, stage, gender, dead, progression, neos50),
    dplyr::select(science2015_cli2, patient, age_start, RECIST, overall_survival, progression_free, primary,
           group, histology, stage, gender, dead, progression) %>% filter(patient %in% c("Pat20", "Pat91"))
)
# combine three datasets
info_VanAllen = full_join(full_join(science2015_cli, APS_VanAllen_df, by = "patient"), 
                         science2015_TMB_summary, by="patient")

info_VanAllen$TIGS = info_VanAllen$APS * log(info_VanAllen$nTMB +1) # to avoid TIGS <0, TIGS = log(TMB +1) *APM for this dataset

save(info_VanAllen, file = "results/VanAllen2015_Info.RData")

Snyder et al 2017 dataset

#----------------------------------------------------------------
# Contribution of systemic and somatic factors to clinical response and resistance in urothelial cancer: an exploratory multi-omic analysis
#---------------------------------------------------------------
# Data Source: https://github.com/XSLiuLab/multi-omic-urothelial-anti-pdl1 
# This is a fork repository, original link can also be found in this link

library(tidyverse)
uro_cli = read_csv("../data/data_clinical.csv", col_types = cols())
uro_counts = read_csv("../data/data_counts.csv", col_types = cols())
uro_variants = read_csv("../data/data_variants.csv",
                        col_types = "c??????")
uro_rna = read_csv("../data/data_kallisto.csv", col_types = cols())
uro_rna_wide = reshape2::dcast(uro_rna, gene_name ~ patient_id, value.var = "est_counts")

# try remove duplicated genes
uro_rna_wide %>% 
    as.data.frame() %>% 
    mutate(mean_expr = rowMeans(.[, -1], na.rm = TRUE)) %>% 
    arrange(gene_name, desc(mean_expr)) %>%
    distinct(gene_name, .keep_all = TRUE) %>% 
    dplyr::select(-mean_expr) %>% 
    as.tibble() -> uro_rna_ge

length(intersect(uro_rna_ge$gene_name, GSE78220_FPKM$Gene))
uro_rna_ge = dplyr::filter(uro_rna_ge, gene_name %in% GSE78220_FPKM$Gene)
uro_rna_ge_Norm = uro_rna_ge
uro_rna_ge_Norm[, -1] = log2(uro_rna_ge_Norm[,-1]+1)

## apply GSVA
applyGSVA(merged_geneList, group_col = "Cell_type", 
          gene_col = "Symbol", ExprMatList = list(uro_rna_ge_Norm), method = "gsva") -> gsva.Snyder

gsva.Snyder = calc_TisIIs(gsva.Snyder[[1]])

APS_Snyder = gsva.Snyder$APM
APS_Snyder = (APS_Snyder - min_APS) / (max_APS - min_APS)

## save gene expression
rna_Snyder = uro_rna_ge_Norm
save(rna_Snyder, file = "results/Snyder2017_RNA.RData")

APS_Snyder_df = tibble(patient_id = colnames(uro_rna_ge_Norm)[-1], 
                       APS = APS_Snyder, 
                       IIS = gsva.Snyder$IIS)

uro_cli_2 = uro_cli %>% 
    dplyr::select(patient_id, Age, Sex, is_benefit, is_benefit_os, os, `Alive Status`) %>% 
    mutate(event = ifelse(`Alive Status` == 'Y', 0, 1))

uro_tmb = uro_variants %>% 
    group_by(patient_id) %>% 
    summarise(TMB = n(), Nonsyn = sum(!is.na(gene_name)), nTMB = TMB/38) %>% 
    mutate(patient_id = as.character(patient_id))

info_Snyder = dplyr::full_join(uro_cli_2, dplyr::full_join(uro_tmb, APS_Snyder_df))
info_Snyder$TIGS = info_Snyder$APS * log(info_Snyder$nTMB)
 
save(info_Snyder, file = "results/Snyder2017_Info.RData")

rm(list = ls())

Gene expression biomarkers for response to immune checkpoint blockade

This section we will describe how to calculate predicted value of gene signature biomarker using gene expression data.

Load data firstly.

Check if all gene signatures are in data.

We got all TRUE. Next we retrieve all gene expression values and then calculate average.

Combine them with sigList, then transform these data.frame to long format, calculate mean gene expression and transform back to wide format.

Show results as tables.

Lastly, merge the three result datasets for downstream analysis.

Input file generation for TIDE

TIDE is developed by Shirley Liu lab, software website is available at (http://tide.dfci.harvard.edu/). We can generate input file for TIDE and submit it to the website for TIDE score calculation.

Following suggestions provided at TIDE website, we firstly generate input files for three immunotherapy datasets .

Next we submit the three tsv file to TIDE website. Previous immunotherapy item is set as “No”. The result csv files are saved to report/results directory.

ROC comparison

This section we compare ROC (AUC value) of biomarkers for response prediction.

library(pROC)
#> Type 'citation("pROC")' for a citation.
#> 
#> Attaching package: 'pROC'
#> The following objects are masked from 'package:stats':
#> 
#>     cov, smooth, var
library(tidyverse)
#theme_set(theme_bw())
#library(cowplot)

#------------------------------------------
# Load data
load("results/Hugo2016_Info.RData")
load("results/VanAllen2015_Info.RData")
load("results/Snyder2017_Info.RData")

load("results/GEBiomarker.RData")

# Of note, output csv from TIDE website has extra "" in last line
#          please remove it before load data
TIDE_hugo = read_csv("results/TIDE_output_hugo.csv", col_types = cols())
TIDE_Snyder = read_csv("results/TIDE_output_Snyder.csv", col_types = cols())
TIDE_VanAllen = read_csv("results/TIDE_output_VanAllen.csv", col_types = cols())

#-----------------------------------------
# Merge datasets
info_hugo = info_hugo %>% 
  dplyr::left_join(dplyr::filter(GEBiomarker, study=="Hugo2016") %>% 
                      dplyr::select(-study), by=c("PatientID" = "sample")) %>% 
  dplyr::left_join(TIDE_hugo %>% dplyr::select(Patient, TIDE), by = c("PatientID" = "Patient"))

#------------------------------------------
# Hugo 2016
hugo2016_roc_TMB = roc(info_hugo$Response, info_hugo$nTMB)
hugo2016_roc_APS = roc(info_hugo$Response, info_hugo$APS)
hugo2016_roc_TIGS = roc(info_hugo$Response, info_hugo$TIGS)
hugo2016_roc_IIS = roc(info_hugo$Response, info_hugo$IIS)
hugo2016_roc_CD8 = roc(info_hugo$Response, info_hugo$CD8)
hugo2016_roc_IFNG = roc(info_hugo$Response, info_hugo$IFNG)
hugo2016_roc_PDL1 = roc(info_hugo$Response, info_hugo$PDL1)
hugo2016_roc_TIDE = roc(info_hugo$Response, info_hugo$TIDE)

ggroc(list(TMB=hugo2016_roc_TMB, TIGS=hugo2016_roc_TIGS, TIDE=hugo2016_roc_TIDE), legacy.axes = TRUE) + 
    labs(color = "Predictor") + ggpubr::theme_classic2(base_size = 12, base_family = "Arial")


# AUC of predictors
auc(hugo2016_roc_TMB)   # AUC of TMB
#> Area under the curve: 0.5994
auc(hugo2016_roc_APS)   # AUC of APS
#> Area under the curve: 0.6978
auc(hugo2016_roc_TIGS)  # AUC of TIGS
#> Area under the curve: 0.7692
auc(hugo2016_roc_IIS)   # AUC of IIS
#> Area under the curve: 0.6154
auc(hugo2016_roc_CD8)   # AUC of CD8
#> Area under the curve: 0.5055
auc(hugo2016_roc_IFNG)  # AUC of IFNG
#> Area under the curve: 0.4945
auc(hugo2016_roc_PDL1)  # AUC of PDL1
#> Area under the curve: 0.4286
auc(hugo2016_roc_TIDE)  # AUC of TIDE
#> Area under the curve: 0.783

#---------------------------
# VanAllen 2015

info_VanAllen = info_VanAllen %>% 
  dplyr::left_join(dplyr::filter(GEBiomarker, study=="VanAllen2015") %>% 
                      dplyr::select(-study), by=c("patient" = "sample")) %>% 
  dplyr::left_join(TIDE_VanAllen %>% dplyr::select(Patient, TIDE), by = c("patient" = "Patient"))

info_VanAllen2 = info_VanAllen %>%
    mutate(Response = case_when(
        group == "response" ~ "R",
        group == "nonresponse" ~"NR",
        TRUE ~ NA_character_
    )) %>% filter(!is.na(Response))


## -- using following command if only compare patients with TIGS availble
## -- the result are basically same
# info_VanAllen2 = info_VanAllen %>%
#     mutate(Response = case_when(
#         group == "response" ~ "R",
#         group == "nonresponse" ~"NR",
#         TRUE ~ NA_character_
#     )) %>% filter(!is.na(Response), !is.na(TIGS))


VanAllen2015_roc_TMB = roc(info_VanAllen2$Response, info_VanAllen2$nTMB)
VanAllen2015_roc_APS = roc(info_VanAllen2$Response, info_VanAllen2$APS)
VanAllen2015_roc_TIGS = roc(info_VanAllen2$Response, info_VanAllen2$TIGS)
VanAllen2015_roc_IIS = roc(info_VanAllen2$Response, info_VanAllen2$IIS)
VanAllen2015_roc_CD8 = roc(info_VanAllen2$Response, info_VanAllen2$CD8)
VanAllen2015_roc_IFNG = roc(info_VanAllen2$Response, info_VanAllen2$IFNG)
VanAllen2015_roc_PDL1 = roc(info_VanAllen2$Response, info_VanAllen2$PDL1)
VanAllen2015_roc_TIDE = roc(info_VanAllen2$Response, info_VanAllen2$TIDE)

ggroc(list(TMB=VanAllen2015_roc_TMB,TIGS=VanAllen2015_roc_TIGS, TIDE=VanAllen2015_roc_TIDE),
      legacy.axes = TRUE) + 
    labs(color = "Predictor") + 
    ggpubr::theme_classic2(base_size = 12, base_family = "Arial")


# AUC of predictors
auc(VanAllen2015_roc_TMB)   # AUC of TMB
#> Area under the curve: 0.6748
auc(VanAllen2015_roc_APS)   # AUC of APS
#> Area under the curve: 0.6894
auc(VanAllen2015_roc_TIGS)  # AUC of TIGS
#> Area under the curve: 0.7552
auc(VanAllen2015_roc_IIS)   # AUC of IIS
#> Area under the curve: 0.6708
auc(VanAllen2015_roc_CD8)   # AUC of CD8
#> Area under the curve: 0.6615
auc(VanAllen2015_roc_PDL1)  # AUC of PDL1
#> Area under the curve: 0.6087
auc(VanAllen2015_roc_IFNG)  # AUC of IFNG
#> Area under the curve: 0.6708
auc(VanAllen2015_roc_TIDE)  # AUC of TIDE
#> Area under the curve: 0.7438

#---------------------------------------
# Snyder 2017

info_Snyder = info_Snyder %>% 
  dplyr::left_join(dplyr::filter(GEBiomarker, study=="Snyder2017") %>% 
                      dplyr::select(-study), by=c("patient_id" = "sample")) %>% 
  dplyr::left_join(TIDE_Snyder %>% dplyr::select(Patient, TIDE), by = c("patient_id" = "Patient"))

Snyder2017_roc_TMB = roc(info_Snyder$is_benefit, info_Snyder$nTMB)
Snyder2017_roc_APS = roc(info_Snyder$is_benefit, info_Snyder$APS)
Snyder2017_roc_TIGS = roc(info_Snyder$is_benefit, info_Snyder$TIGS)
Snyder2017_roc_IIS = roc(info_Snyder$is_benefit, info_Snyder$IIS)
Snyder2017_roc_CD8 = roc(info_Snyder$is_benefit, info_Snyder$CD8)
Snyder2017_roc_PDL1 = roc(info_Snyder$is_benefit, info_Snyder$PDL1)
Snyder2017_roc_IFNG = roc(info_Snyder$is_benefit, info_Snyder$IFNG)
Snyder2017_roc_TIDE = roc(info_Snyder$is_benefit, info_Snyder$TIDE)



ggroc(list(TMB=Snyder2017_roc_TMB, TIGS=Snyder2017_roc_TIGS, TIDE=Snyder2017_roc_TIDE),
      legacy.axes = TRUE) + 
    labs(color = "Predictor") + ggpubr::theme_classic2(base_size = 12, base_family = "Arial")

From comparison of AUC value above, we can conclude that:

  • TIGS and TIDE are better than TMB
  • TIGS and TIDE have almost same performance in two melanoma datasets, however, performance of TIDE go down a little in urothelial cancer dataset, maybe resulting from the fact that TIDE is trained on data of melanoma and NSCLC cancer (this point is stated at TIDE website, cancer type except for melanoma and NSCLC may not work).
  • APS, TMB are also not bad predictors (they have almost same performance)
  • Other predictors are not good predictors, they perform differently in different datasets.

Next, we summary results above for final visualization.

Survival comparison

Above we already show APS, TMB, TIGS and TIDE are good predictors, this section we focus on them. We use cox model and separate patients into High and Low group in K-M survival analysis based on median biomarker predicted value.

library(survival)
library(survminer)
#> Loading required package: ggpubr
#> Loading required package: magrittr
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

#----------------------
# Hugo 2016
hugo_os = read_csv("../data/cell2016_OS.csv", col_types = cols())
info_hugo = full_join(info_hugo, hugo_os, by="PatientID")

# Cox model

# Cox result of TMB
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     nTMB, data = info_hugo)
#> 
#>          coef exp(coef) se(coef)      z     p
#> nTMB -0.03045   0.97001  0.02179 -1.397 0.162
#> 
#> Likelihood ratio test=3.05  on 1 df, p=0.08059
#> n= 37, number of events= 19 
#>    (1 observation deleted due to missingness)

# Cox result of APS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     APS, data = info_hugo)
#> 
#>        coef exp(coef) se(coef)      z     p
#> APS -1.9820    0.1378   1.4845 -1.335 0.182
#> 
#> Likelihood ratio test=1.84  on 1 df, p=0.1755
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     TIGS, data = info_hugo)
#> 
#>         coef exp(coef) se(coef)      z      p
#> TIGS -1.2140    0.2970   0.5536 -2.193 0.0283
#> 
#> Likelihood ratio test=5.9  on 1 df, p=0.01515
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)
#> Call:
#> coxph(formula = Surv(OverallSurvival, VitalStatus == "Dead") ~ 
#>     TIDE, data = info_hugo)
#> 
#>        coef exp(coef) se(coef)    z     p
#> TIDE 0.1668    1.1815   0.3086 0.54 0.589
#> 
#> Likelihood ratio test=0.29  on 1 df, p=0.5919
#> n= 26, number of events= 12 
#>    (12 observations deleted due to missingness)


# K-M fit
info_hugo %>%  # take care of NA values
  mutate(APS_Status = ifelse(APS>median(APS, na.rm = TRUE), "High", 
                        ifelse(is.na(APS), NA, "Low")),
    TIDE_Status = ifelse(TIDE>median(TIDE, na.rm = TRUE), "High",
                         ifelse(is.na(TIDE), NA, "Low")),
    TMB_Status = ifelse(nTMB>median(nTMB), "High", "Low"),
    TIGS_Status = ifelse(TIGS>median(TIGS, na.rm = TRUE), "High",
                         ifelse(is.na(TIGS), NA, "Low"))) -> info_hugo2

# effect of TMB status on survival 
fit1 = survfit(Surv(OverallSurvival, VitalStatus == "Dead") ~ TMB_Status, data = info_hugo2)
ggsurvplot(fit1, data = info_hugo2, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)")


#----------------------
# Van Allen 2015

# Cox model

# Cox result of TMB
coxph(Surv(overall_survival, dead== 1) ~ nTMB, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ nTMB, data = info_VanAllen)
#> 
#>           coef exp(coef)  se(coef)      z     p
#> nTMB -0.007145  0.992880  0.005635 -1.268 0.205
#> 
#> Likelihood ratio test=1.94  on 1 df, p=0.1632
#> n= 110, number of events= 83 
#>    (2 observations deleted due to missingness)

# Cox result of APS
coxph(Surv(overall_survival, dead== 1) ~ APS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ APS, data = info_VanAllen)
#> 
#>        coef exp(coef) se(coef)      z      p
#> APS -1.6820    0.1860   0.7618 -2.208 0.0273
#> 
#> Likelihood ratio test=5.23  on 1 df, p=0.0222
#> n= 42, number of events= 29 
#>    (70 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(overall_survival, dead== 1) ~ TIGS, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIGS, data = info_VanAllen)
#> 
#>         coef exp(coef) se(coef)      z     p
#> TIGS -0.5135    0.5984   0.2676 -1.919 0.055
#> 
#> Likelihood ratio test=4.68  on 1 df, p=0.03058
#> n= 40, number of events= 27 
#>    (72 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(overall_survival, dead== 1) ~ TIDE, data = info_VanAllen)
#> Call:
#> coxph(formula = Surv(overall_survival, dead == 1) ~ TIDE, data = info_VanAllen)
#> 
#>        coef exp(coef) se(coef)     z       p
#> TIDE 0.5358    1.7088   0.1813 2.955 0.00313
#> 
#> Likelihood ratio test=8.22  on 1 df, p=0.004134
#> n= 42, number of events= 29 
#>    (70 observations deleted due to missingness)

info_VanAllen %>% 
  filter(group != "long-survival") %>% 
  mutate(APS_Status = ifelse(APS>median(APS, na.rm = TRUE), "High", 
                             ifelse(is.na(APS), NA, "Low")),
         TIDE_Status = ifelse(TIDE>median(TIDE, na.rm = TRUE), "High", 
                              ifelse(is.na(TIDE), NA, "Low")),
         TMB_Status = ifelse(nTMB>median(nTMB, na.rm = TRUE), "High",
                              ifelse(is.na(nTMB), NA, "Low")),
         TIGS_Status = ifelse(TIGS>median(TIGS, na.rm = TRUE), "High",
                              ifelse(is.na(TIGS), NA, "Low")))  -> info_VanAllen3

# effect of TMB status on survival 
fit1 = survfit(Surv(overall_survival, dead== 1) ~ TMB_Status, data = info_VanAllen3)
ggsurvplot(fit1, data = info_VanAllen3, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)")


#----------------------
# Snyder 2017

# Cox model

# Cox result of TMB
coxph(Surv(os, event) ~ nTMB, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ nTMB, data = info_Snyder)
#> 
#>          coef exp(coef) se(coef)      z     p
#> nTMB -0.02115   0.97907  0.02151 -0.983 0.325
#> 
#> Likelihood ratio test=1.1  on 1 df, p=0.2946
#> n= 22, number of events= 14 
#>    (6 observations deleted due to missingness)

# Cox result of APS
coxph(Surv(os, event) ~ APS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ APS, data = info_Snyder)
#> 
#>       coef exp(coef) se(coef)      z    p
#> APS -1.010     0.364    1.104 -0.915 0.36
#> 
#> Likelihood ratio test=0.85  on 1 df, p=0.3576
#> n= 25, number of events= 17 
#>    (3 observations deleted due to missingness)

# Cox result of TIGS
coxph(Surv(os, event) ~ TIGS, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIGS, data = info_Snyder)
#> 
#>         coef exp(coef) se(coef)     z     p
#> TIGS -0.4863    0.6149   0.3450 -1.41 0.159
#> 
#> Likelihood ratio test=2.39  on 1 df, p=0.1219
#> n= 22, number of events= 14 
#>    (6 observations deleted due to missingness)

# Cox result of TIDE
coxph(Surv(os, event) ~ TIDE, data = info_Snyder)
#> Call:
#> coxph(formula = Surv(os, event) ~ TIDE, data = info_Snyder)
#> 
#>         coef exp(coef) se(coef)      z     p
#> TIDE -0.3926    0.6753   0.3332 -1.178 0.239
#> 
#> Likelihood ratio test=1.44  on 1 df, p=0.2301
#> n= 25, number of events= 17 
#>    (3 observations deleted due to missingness)


info_Snyder %>% 
  mutate(APS_Status = ifelse(APS>median(APS, na.rm = TRUE), "High", 
                             ifelse(is.na(APS), NA, "Low")),
         TIDE_Status = ifelse(TIDE>median(TIDE, na.rm = TRUE), "High", 
                              ifelse(is.na(TIDE), NA, "Low")),
         TMB_Status = ifelse(nTMB>median(nTMB, na.rm = TRUE), "High",
                              ifelse(is.na(nTMB), NA, "Low")),
         TIGS_Status = ifelse(TIGS>median(TIGS, na.rm = TRUE), "High",
                              ifelse(is.na(TIGS), NA, "Low")))  -> info_Snyder2

# effect of TMB status on survival 
fit1 = survfit(Surv(os, event) ~ TMB_Status, data = info_Snyder2)
ggsurvplot(fit1, data = info_Snyder2, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)")

From comparison of cox results and K-M plots above, we can conclude that:

  • TIGS is very robust, in all three datasets, “High group” all show statistically significant better survival than “Low group”.
  • TIDE performs just so-so in Hugo 2016 dataset, very good in Van Allen 2015 dataset. But, it losts its power in Snyder 2017 dataset, patients with low TIDE value should have better survival but in this dataset patients with high TIDE value have better survival.
  • Patients with High TMB or APS in three datasets all show trends of better survival but not statistically significant.

Next, we summary results (cox HR value and K-M fit) above for final visualization.

# summary cox result
cox_TMB = list()
cox_APS = list()
cox_TIGS = list()
cox_TIDE = list()

summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ nTMB, data = info_hugo)) -> cox_TMB[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ APS, data = info_hugo)) -> cox_APS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIGS, data = info_hugo)) -> cox_TIGS[[1]]
summary(coxph(Surv(OverallSurvival, VitalStatus == "Dead") ~ TIDE, data = info_hugo)) -> cox_TIDE[[1]]

summary(coxph(Surv(overall_survival, dead== 1) ~ nTMB, data = info_VanAllen)) -> cox_TMB[[2]]
summary(coxph(Surv(overall_survival, dead== 1) ~ APS, data = info_VanAllen)) -> cox_APS[[2]]
summary(coxph(Surv(overall_survival, dead== 1) ~ TIGS, data = info_VanAllen)) -> cox_TIGS[[2]]
summary(coxph(Surv(overall_survival, dead== 1) ~ TIDE, data = info_VanAllen)) -> cox_TIDE[[2]]

summary(coxph(Surv(os, event) ~ nTMB, data = info_Snyder)) -> cox_TMB[[3]]
summary(coxph(Surv(os, event) ~ APS, data = info_Snyder))  -> cox_APS[[3]]
summary(coxph(Surv(os, event) ~ TIGS, data = info_Snyder)) -> cox_TIGS[[3]]
summary(coxph(Surv(os, event) ~ TIDE, data = info_Snyder)) -> cox_TIDE[[3]]

cox_df = tibble(
  Group = c(rep("Hugo2016",4), rep("VanAllen2015",4), rep("Snyder2017",4)),
  Biomarker = rep(c("TMB", "APS", "TIGS", "TIDE"), 3),
  HR = c(
    as.numeric(cox_TMB[[1]]$conf.int[1]),
    as.numeric(cox_APS[[1]]$conf.int[1]),
    as.numeric(cox_TIGS[[1]]$conf.int[1]),
    as.numeric(cox_TIDE[[1]]$conf.int[1]),
    as.numeric(cox_TMB[[2]]$conf.int[1]),
    as.numeric(cox_APS[[2]]$conf.int[1]),
    as.numeric(cox_TIGS[[2]]$conf.int[1]),
    as.numeric(cox_TIDE[[2]]$conf.int[1]),
    as.numeric(cox_TMB[[3]]$conf.int[1]),
    as.numeric(cox_APS[[3]]$conf.int[1]),
    as.numeric(cox_TIGS[[3]]$conf.int[1]),
    as.numeric(cox_TIDE[[3]]$conf.int[1])
  ),
  Pvalue = c(
    as.numeric(cox_TMB[[1]]$logtest["pvalue"]),
    as.numeric(cox_APS[[1]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[1]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[1]]$logtest["pvalue"]),
    as.numeric(cox_TMB[[2]]$logtest["pvalue"]),
    as.numeric(cox_APS[[2]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[2]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[2]]$logtest["pvalue"]),
    as.numeric(cox_TMB[[3]]$logtest["pvalue"]),
    as.numeric(cox_APS[[3]]$logtest["pvalue"]),
    as.numeric(cox_TIGS[[3]]$logtest["pvalue"]),
    as.numeric(cox_TIDE[[3]]$logtest["pvalue"])
  )
)

# show summary data as table
DT::datatable(cox_df, 
              options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE, 
              caption = "HR of biomarkers for response prediction in three ICB datasets")

Visualization of comparison profile

This section we visualize comparsion profile, integrate analysis results of 3 immunotherapy datasets into a big figure.

Integrate ROC curve.

#--- KM-plots
p_surv_APS = list()
p_surv_TMB = list()
p_surv_TIGS = list()
p_surv_TIDE = list()

p_surv_APS[[1]] = ggsurvplot(km_APS$VanAllen, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("APS High", "APS Low"))
p_surv_APS[[2]] = ggsurvplot(km_APS$Hugo, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("APS High", "APS Low"))
p_surv_APS[[3]] = ggsurvplot(km_APS$Snyder, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("APS High", "APS Low"))

p_surv_TMB[[1]] = ggsurvplot(km_TMB$VanAllen, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TMB High", "TMB Low"))
p_surv_TMB[[2]] = ggsurvplot(km_TMB$Hugo, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TMB High", "TMB Low"))
p_surv_TMB[[3]] = ggsurvplot(km_TMB$Snyder, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TMB High", "TMB Low"))

p_surv_TIGS[[1]] = ggsurvplot(km_TIGS$VanAllen, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIGS High", "TIGS Low"))
p_surv_TIGS[[2]] = ggsurvplot(km_TIGS$Hugo, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.4), legend.title=element_blank(),legend.labs = c("TIGS High", "TIGS Low"))
p_surv_TIGS[[3]] = ggsurvplot(km_TIGS$Snyder, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIGS High", "TIGS Low"))

p_surv_TIDE[[1]] = ggsurvplot(km_TIDE$VanAllen, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIDE High", "TIDE Low"))
p_surv_TIDE[[2]] = ggsurvplot(km_TIDE$Hugo, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIDE High", "TIDE Low"))
p_surv_TIDE[[3]] = ggsurvplot(km_TIDE$Snyder, pval = TRUE, fun = "pct", 
           xlab = "Time (in days)", palette = c("red", "blue"), 
           legend = c(0.8, 0.9), legend.title=element_blank(),legend.labs = c("TIDE High", "TIDE Low"))
  

#ggpar(p , font.legend = c(12, "bold", "black"))

Shixiang Wang

2018-12-06